In [ ]:
import pandas as pd
import plotly.express as px

df = pd.read_csv('ViewingActivityModified.csv')

Let's drop columns that we don't need and convert some columns so we can see how many hours we've spent watching Netflix. I've also modified user names to give other users a bit of privacy because this report can be found online.¶

In [ ]:
df = df.drop(columns=['Attributes', 'Supplemental Video Type', 'Bookmark', 'Latest Bookmark', 'Country'])
df['Start Time'] = pd.to_datetime(df['Start Time'], utc=True)
df['Duration'] = pd.to_timedelta(df['Duration']).dt.seconds/3600 #Convert duration to timedelta format and then to minutes

Which profile has the most interactions? Interactions represent number of times content was opened. For example, if you open a movie, then close it and resume watching later then number of interactions would be 2.¶

In [ ]:
df['Profile Name'].value_counts()
Out[ ]:
User 2      5832
User 4      5041
User 1      4224
Kristers    3049
User 3      2962
Name: Profile Name, dtype: int64

Total watchtime on all profiles in hours. This number could be represented in days but I feel like hours are a bit more comprehensible and I can graph this data without changing formats.¶

In [ ]:
round(df['Duration'].sum(), 2)
Out[ ]:
6019.61

Now that we know the total amount of hours we can find out what percentage of this number each user takes up.¶

In [ ]:
fig1 = px.pie(df, names='Profile Name', values='Duration', title='Percentage of total watchtime for each profile.')
fig1.show(renderer='notebook')

I want to represent this data with devices added to each user on which they spent this time. Let's find out device names listed in this data frame and replace them with categories for each device.¶

In [ ]:
df["Device Type"].unique()
Out[ ]:
array(['Samsung CE 2021 Nike-L UHD TV  Smart TV', 'Apple iPhone 13',
       'Chrome PC (Cadmium)', 'LG 2017 RTK K3Lp Standard UHD TV Smart TV',
       'DefaultWidevineAndroidPhone',
       'Netflix Chrome MAC (Cadmium) HTML 5', 'LG 2014 LG SoC Smart TV',
       'Android DefaultWidevineL3Phone Android Phone',
       'LG 2015 LG SoC DTV Smart TV',
       'Android DefaultWidevineL3Tablet Android Tablet',
       'LG 2019 RTK K5Lp Standard UHD TV Smart TV',
       'LG 2018 Mstar M3 Standard DTV Smart TV', 'Firefox PC (Cadmium)'],
      dtype=object)
In [ ]:
df = df.replace(to_replace=["Samsung CE 2021 Nike-L UHD TV  Smart TV", "LG 2017 RTK K3Lp Standard UHD TV Smart TV", "LG 2014 LG SoC Smart TV", "LG 2015 LG SoC DTV Smart TV", "LG 2019 RTK K5Lp Standard UHD TV Smart TV", "LG 2018 Mstar M3 Standard DTV Smart TV"], value="TV")
df = df.replace(to_replace=['Apple iPhone 13', 'DefaultWidevineAndroidPhone', 'Android DefaultWidevineL3Phone Android Phone'], value='Phone')
df = df.replace(to_replace=['Chrome PC (Cadmium)', 'Netflix Chrome MAC (Cadmium) HTML 5', 'Firefox PC (Cadmium)'], value='Web Browser')
df = df.replace(to_replace='Android DefaultWidevineL3Tablet Android Tablet', value='Tablet')

fig2 = px.histogram(df, x='Profile Name', y='Duration', color='Device Type', text_auto=True, title='Watch time on different devices in hours.')
fig2.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig2.show(renderer='notebook')

Next let's graph hours watched each day since the creation of this profile. First let's create a new data frame that will hold each date and watch time duration. Then we can group this data by date and store it in our new data frame.¶

In [ ]:
df2 = pd.DataFrame()
df2['Date'] = pd.to_datetime(df['Start Time']).dt.date
df2['Duration'] = df['Duration']
watch_time = df2.groupby(['Date'])['Duration'].sum().reset_index()
df2 = pd.DataFrame(watch_time)

fig3 = px.line(df2, x='Date', y='Duration', title='Total time spent watching per day for all profiles in hours.')
fig3.show(renderer='notebook')

Let's figure out which weekdays were the most popular for watching Netflix. I'm reusing previous graphs data frame and converting 'Date' column back to datetime format so we can extract weekday names out of those dates.¶

In [ ]:
df2['Date'] = pd.to_datetime(df2['Date'])
df2['Weekday'] = df2['Date'].dt.day_name()
fig4 = px.histogram(df2, x='Weekday', y='Duration', title='Time spent watching per weekday in hours.')
fig4.show(renderer='notebook')

Next I'd like to see the most watched shows for all profiles. I could also do this for each individual profile but I feel like that would be a bit invasive on their privacy because not everyone would want to share this with others.¶

In [ ]:
df['Title'] = df['Title'].astype(str).str.split(':').str[0] #Modify original df to remove everything after ':' in titles.

show_watch_time = df.groupby(['Title'])['Duration'].sum().reset_index()
show_watch_time = show_watch_time.sort_values(by='Duration', ascending=False).head(25)

df3 = pd.DataFrame(show_watch_time)
df3 = df3.sample(frac=1)#Randomize row order to make graph more visually pleasing

fig5 = px.scatter(df3, x='Title', y='Duration', size='Duration', color='Duration', title='Top 25 most watched shows on all profiles.')
fig5.show(renderer='notebook')

There's more research that could be done for individual profiles like "What time of day is their favorite for watching shows". We could also group titles based on their type (Movie/TV Show) using some kind of third party API or add genres for titles so we can analyse favorites for each user. But for now I think I've gathered enough insights out of this data set.¶